WindowMasker: window-based masker for sequenced genomes

نویسندگان

  • Aleksandr Morgulis
  • E. Michael Gertz
  • Alejandro A. Schäffer
  • Richa Agarwala
چکیده

MOTIVATION Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. RESULTS We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. AVAILABILITY WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. SUPPLEMENTARY INFORMATION Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The binaural temporal window in adults and children.

This study investigated the binaural temporal window in adults and children 5-10.5 years of age. Detection thresholds were estimated for a brief, interaurally out-of-phase (Spi) 500 Hz pure tone signal masked by bandpass, 100-2000 Hz Gaussian noise. In one set of conditions, the masker was consistently either in phase (No) or out of phase (Npi). In another set of conditions, the masker changed ...

متن کامل

SWAMP: Sliding Window Alignment Masker for PAML

With the greater availability of genetic data, large genome-wide scans for positive selection increasingly incorporate data from a range of sources. These data sets may be derived from different sequencing methods, each of which has potential sources of error. Sequencing errors, compounded by alignment errors, greatly increase the number of false positives in tests for adaptive evolution. Genom...

متن کامل

Modeling the additivity of nonsimultaneous masking.

Thresholds were measured for detecting a brief 6-kHz sinusoidal signal preceded by a broadband noise masker (forward masking), followed by the masker (backward masking), or both preceded by and followed by the masker (combined masking). The masker-signal interval was systematically varied. Consistent with the literature, thresholds in the combined-masking condition were higher than would be pre...

متن کامل

Modeling Auditory Perception for Robust Speech Recognition

Forward masking stimuli: (A) Large timescale view of a single 2AFC trial; (B) Fourier Transform of the probe signal (128 ms rectangular window); (C) Smaller timescale view of the probe following the masker by 15 ms.. Average forward masking data (circles), and std. dev. (error bars), together with the model fit (lines) as a function of masker level across 5 octaves, with probe delays of 15, 30,...

متن کامل

Computational Identification and Characterization of Repeats in Sequenced Eukaryotic Genomes

Repetitive sequences or repeats are often called “junk DNA”, for they do not seem to provide any sequence specific function in the genome in general. These sequences are ubiquitous and abundant in all species examined to date. It is generally believed that repeats have profound impact on genome evolution and genome organization. The recent availability of whole genome sequences has opened a new...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 22 2  شماره 

صفحات  -

تاریخ انتشار 2006